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Introduction 

The Transputer was too far ahead of its time. Update the clock speeds, and the 
architecture would be impressive today. It was a Microcomputer, having a cpu, memory, 
and I/O on one chip. External logic required was minimal. Large arrays of Transputers 
were easily implemented. However, like many advanced technological artifacts, it was 
hard to understand. It took a while to get used to the software approach. The tools were 
difficult to use. In fact, the software approach, the conceptual model, was what made the 
Transputer powerful. The implementation in silicon came later. You had to understand 
and buy into the conceptual model and then the software to maximize your return from 
the Transputer. A steep learning curve was involved. In the end, the Transputer was 
overtaken by simpler, better- funded, mainstream approaches. 

The Inmos Transputer architecture was introduced in 1985 as a single chip 
microcomputer architecture, optimized for parallel use in Multiple Instruction, Multiple 
Data (MIMD) configurations. It provided excellent and balanced interprocessor 
communications as well as computational ability. Transputers provided the capability to 
implement scalable systems. It was truly a system-on-a-chip, with processor, memory, 
and I/O. Timers were included internally, and the chip requires merely a crystal to derive 
its own clock. The timers enabled real time programming and process scheduling. 
Because the input clock was 5 MHz regardless of the internal rate of the Transputer, a 
master clock could be distributed across a board. 

The architecture of the Transputer from Inmos Corporation can best be understood by 
looking at the origins of the device. The Transputer family was a British design. In 
essence, the Transputer implements in hardware the parallel language Occam. Emphasis 
was placed on communicating between processes as well as computational ability. To 
understand the Transputer architecture you have to understand Occam, and 



communicating parallel processes. However, to use the Transputer, you could program in 
C, Pascal, Fortran, or several other familiar languages. Occam remained a barrier to 
widespread acceptance of the Transputer, but it remains the most efficient software tool 
that matches the hardware. Most working system designers prefer to use most familiar 
tools, and accept the sub optimum solution. 

Since the introduction of the Transputer in 1985, the unit went on to be the world leader 
in number of units shipped for any RISC processor in 1989 and 1990. By the end of 
1990, over 1/2 million Transputer units had been shipped, which translated into a large 
installed base, even considering that a large number of units went into embedded 
applications. 

Inmos had been founded by the British Government in 1978. A U.S. Headquarters in 
Colorado Springs was established. In 1984, Inmos was purchased by Thorn EMI. It later 
was sold to SGS-Thompson, a major semiconductor manufacturer and systems developer 
in Europe, which also manufactured graphics chips and memories. Fabrication and 
manufacturing facilities were located in Europe and the United States. A military 
products line provided Transputer family parts in compliance with MIL-STD-883. As 
single chip microcontrollers, Transputers have flown on space missions. 

As controllers, Transputers provided an excellent approach, as they included cpu, 
memory, and I/O in one package. A minimum of external components was required for 
systems. Using their unique interprocessor communication architecture, Transputers were 
the ideal building block for parallel systems. 

The Transputer processor family consisted of the 16-bit T222 and T225, and the 32-bit 
T400, T414, T800, T801, T805. All were microcomputers, with ALU, memory, and I/O 
in one package. The early T200 devices were 16 bit architectures, with subsequent 
models being true 32 bit. The T400 series of Transputers included 32 bit integer 
processors, 4 k bytes of internal memory, and high-speed serial links. The T-800 family 
added an integral 64-bit floating-point processor in 1987. The T-9000 series expanded the 
capabilities of the device by a factor of 10 in computation and communication. 

The T800 Transputer could achieve 30 MIPS peak, 15 MIPS sustained, and 4.3 Mflops, 
peak, 2.25 sustained. Not bad for the day. It was twice the performance of Intel's 
80386/80387 pair. Low power operation was also a feature. The T-800 would only be 
drawing 0.75 watts. 

The T-801 Transputer was a variation of the T-800 with a non-multiplexed address/data 
bus. Thus, the package size (i.e., pin count) was larger than the standard T-800. It was 
easier to interface directly with external static memory than the T-800. The T-805 was a 
T-800 with some additional debug instructions, and with several additional control 
signals to ease the design of dynamic memory systems in a dma environment. In addition, 
the T805 had support for 2-dimensional graphics via a new set of instructions, and CRC 
calculations on arbitrary length data streams. Multiplexed data and address lines were 
used. 



The M212 was a special purpose Transputer device for peripheral control. It was a 
derivative of the T2xx 16-bit architecture, with specific interface logic for disk drives. It 
allowed disk drives to become nodes on a network of link-connected Transputers. 

The Transputer featured high speed interconnect by means of full duplex asynchronous 
serial communications. Associated communication interface devices included the CO 11 
and CO 12, which interfaced parallel data transfers to the two-wire link protocol. The 
C004 device was a 32 x 32 crossbar switch for links. The C004 was fully programmable, 
dynamically switchable, and controlled by a link interface. 

In addition to the general purpose T-800 series, Inmos manufactured a series of special 
purpose processor units. These included the A- 100 Cascadable Signal Processor, the A- 
110 Image and Signal Processing subsystem, the A-121 2-D discrete cosine transform 
image processor, and the STI3220 Motion Estimation Processor. 

The follow-on processor to the T-800 was announced by Inmos in April 1991, and was 
called the T-9000. This unit was code-compatible with existing T-800 units, but provided 
an order-of-magnitude performance increase in both computational and communication 
capability. The unit incorporated a 32-bit integer processor, 64-bit floating-point 
processor, 16 kilobytes of internal memory, four upgraded I/O links, and two 
configuration links on a single chip. In addition, the interrupt and memory interface were 
improved. The new computer chip was accompanied by the C 104 packet routing chip, 
and the C 100 system protocol converter, which translated between T-800 class link 
communication, and the T-9000 scheme. The T-9000 appeared in Spring, 1993. There 
were rumors of a follow-on 0.5 micron processor code named El. 

The T-9000 chip maintained binary code compatibility with the T-800, and could be 
mixed in systems with the earlier processor. However, both the internal processor 
instruction rate and the link communication rate had been enhanced. The T-9000 was a 
superscalar, pipelined architecture, with extensive on-chip cache. 

The Transputer family was derived as the instantiation of the Occam language, developed 
at Oxford University. 

Architecture 

The Transputer was a fast single chip microcomputer requiring a minimum of external 
support chips. This low cost unit performed fast, on-chip, 64-bit floating point processing 
and had built-in support for parallelism. The key to the understanding of the architecture 
is the language Occam. 

Hardware 

The Transputer itself was a small (1 sq. inch) 84-pin chip which had a high degree of 
functional integration. External support requirements were minimal. The Transputer even 



had 4k bytes of fast RAM on-chip so that minimal systems could be built with no 
external memory. This memory was located at the base (low address end) of the memory 
space. The RESET instruction went to the top of memory. The Transputer supported very 
fast on-chip floating point and had four bi-directional serial links built into the chip, 
operating at a DMA rate of 20 Mbps each. These links allowed the Transputer to be 
connected as building blocks into arrays of arbitrary size and complexity. The advantage 
of the Transputer architecture lay not only in its computation speed, but also in its I/O 
capacity. A reasonable balance of processing:I/0 could be configured for a wide range of 
applications. 

The Transputer's four kilobyte on-chip ram could be allocated for cache, data, or 
instructions, and simple programs executing entirely from on-chip ram were very fast. 
Four kilobytes may not seem like a lot of space, but recall that instructions were one byte. 
On-chip ram provides single-cycle access, while external memory was a minimum of 
three-cycle access. 

Instead of having a large number of registers on the chip, the Transputer was actually a 
stack machine. This provided for a very fast task context switch for interrupt response 
and task switching. The three-deep operand stack corresponds to the 3 -address instruction 
format of other processors. The transistors, or silicon real estate normally used for 
scoreboarded registers was devoted to on-chip fast static ram, which could be used for 
code, data, or stack space. The external memory space was spanned by 32 address bits, 
and was addressed in a flat model. The internal memory provides a fast access 
workspace. Only workspace and instruction pointers were saved in a context switch. For 
interrupts, the three stack registers also need to be saved. The workspace pointer register 
locates the local variables in memory. Transputers access words, except for byte and 
Boolean arrays. 

The T-9000 featured a superscalar architecture to achieve a 150 mips peak, 60 mips 
sustained integer performance, and a 20 mflops peak, 10 mflops sustained processing 
rate. It featured 80 megabytes/second input-output capability, simultaneously. The T- 
800's 4k onboard memory was updated to 16 kilobytes, which could be used for code, 
data, stack, or cache. The external address space of the processor was 4 gigabytes. The 
processor achieved performance by using extensive pipelining. In excess of 3 million 
transistors were utilized in the device. 

A Transputer had a number of simple operating system functions built into the hardware. 
These included hardware multitasking with foreground and background priority levels, 
hardware timers, and hardware time-slicing of background tasks. I/O set-up was 
extremely simple with this device requiring typically three instructions to initiate DMA 
read or write across a link with automatic task disabling until I/O completion or timer 
expiration. Interrupt context switching was also very fast, typically less than one 
microsecond. In addition, generated code was extremely compact, the most commonly 
used instructions being only 1 byte long. An OPERATE instruction was included to 
extend the instruction set. The arithmetic instructions included ADD and ADD constant, 
subtract, multiply, divide, and remainder. A jump instruction and subroutine call/return 



provided for transfer of control. The usual bit manipulation and shift/rotate instructions 
were included, as was support for long (64-bit) arithmetic. The floating point instructions 
on the advanced chip models added 64 extra opcodes for add, subtract, multiply, divide, 
and normalize as well as shifts. Some models implemented a 2-dimensional block move, 
clip, and draw. Instructions to calculate CRC on a word, count bits, or reverse bits were 
to be found. Scheduling opcodes were included for starting, running, and stopping a 
process, as well as priority operations. 

The Transputer could boot from ROM, or from a link, selectable by the state of a data 
pin. Booting from link allowed for fast initialization of a large collection of 
interconnected Transputers. 

A Transputer represented a computing resource with both integer and floating point 
calculation capability, and with I/O resources of four links of 20 megabits /second input 
and output simultaneously. Transputers could be connected in a variety of network 
topologies. The links could be hard-wired, jumpered, or connected via the Inmos C002 
32x32 programmable crossbar switch. Each link supported two channels in Occam. 

The Transputer provided a scalable solution to processing requirements only now being 
matched by other architectures. The iWarp by Intel and the TMS320C040 parallel DSP 
by Texas Instruments also put emphasis on communication and computation. 

In summary, there was no contemporary processor that provided the same level of 
connectivity as the Transputer, and the Transputer was ahead in terms of Processor-I/O 
balance. The closest candidate was the Texas Instruments 'C40 Parallel DSP, which was 
marketed as a DSP, not a general purpose computer. The communication architecture of 
the C40 used parallel ports with dma engines, and was necessarily distance-limited. No 
other architecture could be found that provided the Transputer's inherent communication 
capability and connectivity, without extensive glue-logic. In fact, many emerging board 
level systems used the Transputer as a communication element for fast RISC processors 
such as Intel's i860 or Motorola 96000 series. The communication bandwidth of a system 
using Transputer links goes up linearly with the number of Transputers. These links, are, 
however, point-point, but do operate with no CPU involvement or overhead. Connectivity 
was four nodes per Transputer. 

A microcoded scheduler maintained time sharing between processes running on the 
hardware. Two priorities were provided. Besides the internal execution units, memory, 
and link I/O, several integrated functions were provided. Support for external dram 
memory was provided by 17 different selectable timing configurations. DMA 
handshaking was implemented, and an interrupt request and acknowledge are 

The Transputer had two 32-bit timers. The high priority process timer was incremented 
every microsecond. The low priority process timer was incremented every 64 
microseconds. Timers were used for process scheduling, and the current value could be 
read onto the stack with a load timer instruction. A Transputer's external memory 
interface was VonNeumann, using 32-bit wide address and data paths. No memory 



management or virtual memory features were included. Upon reset, the Transputer could 
boot from a ROM, or from one of the serial links . This feature was selected by the state 
of a pin on the processor. When booting from ROM, one Transputer could initialize a 
whole network of other Transputers by transferring the initial program serially over the 
links. In the case of a boot ROM, control was passed to the top two bytes in the address 
space. Transputer processors and I/O chips all used a simple 5 Megahertz system clock, 
that was then phase-locked within the chip to the proper operating frequency. Processor 
speed was selectable by three hardware pins on the T800 series. 

The Transputer links provided high speed serial communication between processors and 
processors and the outside world. Physically, there were two lines per link, input, output, 
and ground. Each link provides two bidirectional channels, because each wire carry's the 
data for one channel, and the acknowledge for the other. The link protocol was that a 
message was transferred as a sequence as bytes. This implies the wordlengths of the 
sender and receiver need not be the same. Acknowledgment was on a per-byte basis. The 
incoming data was sampled at five times the bit frequency. 

No JTAG support was included. Support for debugging include the analyze and error 
pins, and a breakpoint instruction. When the analyze signal was asserted, the Transputer 
will halt after high priorities processes complete. State was saved, and memory refresh 
continues. A Transputer will assert the error pin upon detection of an internal error state, 
such as overflow or division by zero. The associated error flag must be specifically 
cleared by an instruction. An error input pin was also provided, which was or-ed with the 
internal error flag. Thus, arrays of Transputers could pass error information along to the 
master unit. 

The four independent serial links on each Transputer support bidirectional asynchronous 
communication at TTL levels, and operate concurrently with the integer and floating 
point engines. Internal to the Transputer chip, the links begin and end at OMA engines, 
that operate concurrently with calculations. Thus, it was possible for a Transputer to be 
executing an integer and a floating point operation, and sending/receiving on 4 channels 
simultaneously. 

The links implement a point-point protocol, with a separate 32x32 crossbar switch chip 
(C004) available. Virtual channel architecture could be implemented on the T-800 series, 
with hardware support in the T-9000 series. T-800 Links support 20 mbps bidirectional 
communication on a 3 wire medium. Messages were transferred as a series of bytes, with 
a byte acknowledge. There was a 3-bit per byte overhead on transmission, and the 
acknowledge packet was two bits in length. The link adapter allows the interface of 8 bit 
bidirectional data to the Transputer link protocol, and thus serves as a custom uart. It was 
applicable where interfacing to other communication standards such as MIL-STO-1553 
was desired. The crossbar switch allows the connection among Transputers to be 
configurable instead of hardwired. It serves as essentially a telephone exchange between 
Transputers. 

Floating point 



The floating point unit of the T80x series uses the IEEE 754 format, and provided 
operations on 32 or 64 bit data. These functions include add, subtract, multiply, divide, 
format conversion, comparison, and square root. All rounding modes of the standard 
were supported. Operation of the unit was microcoded, and redundant sets of three 
registers organized as stacks were included. An error flag could be set by the unit, and 
read by the integer processor. 



Cache 

The Transputer does not make use of cache, having memory on-chip as part of the normal 
memory space. The T-9000 allows for some of its internal memory to be considered 
cache. 

Memory management 

The Transputer did not implement memory management. It utilized a flat 32-bit address 
space. 

Exceptions 

Interrupts or exceptions were termed events in the Transputer, and a single request and 
acknowledge pin were available. Associated with an event was a process (handler) that 
was scheduled when the event happens. Latency was typically 19 cycles, and a maximum 
of 78 cycles (58 for no floating point operations). This assumes no high priority task was 
active at the time of the event. Otherwise, the event, a high priority task, must await the 
completion of the current high priority task. 

Software 

All Transputer instructions were 1 byte in size, with an operate instruction used to 
expand the repertoire to 145 instructions. The subset of 31 single byte instructions were 
used about 80% of the time. An on-chip instruction queue handles four byte-sized 
instructions fetched simultaneously from memory over the 32 bit bus. As a direct 
counterexample, microcoding was used to decode instructions. The instruction format 
was a four bit operation code, followed by a 4 bit data value. In this scheme, 13 of the 4 
bit codes were assigned to important functions such as load, store, add, and jump. Two 
more codes were used in conjunction with the operand register. The last opcode was an 
operate instruction, which specifies 16 operations on the top of the stack. Up to 70% of 
encoded instructions were single byte, in this scheme, and most require just one cycle to 
execute. The integer unit implemented the four basic math operations, as well as left and 
right arithmetic shifts and rotates. The instruction set also included load and store, 
conditional jump, the logical and, or, xor, and not, bit reversal and extension, stack 
operations push and pop, Floating point instructions included floating load and store, and 
operate instructions, implicitly referencing the top of the floating point stack. I/O 



instructions were organized as link commands to input or output a byte, a word, or an 
arbitrary length message In the T805, block move instructions were included to address 
high speed graphics applications. Also, CRC calculation instructions were included, as 
well as bit count features. 

The cpu of the Transputer had 3 registers, organized as a stack. Similarly, the floating 
point units had a 3 register stack. The three values on the stack provided a triadic operand 
for the opcode. It was up to the compiler to ensure that no more than 3 elements were on 
the stack at any given time. The integer stack was used for operand address generation for 
the floating point unit as well as integer operands. The floating point stack was 
duplicated, allowing fast context switching because it did not need to be saved and 
restored. 

Besides the stack registers, the cpu included a workspace pointer to local variable 
memory, an instruction pointer (program counter), and an operand register. The 
Transputer memory with a 32 bit address. The 4 k byte internal memory was located at 
address 80000000, and could be accessed in one machine cycle. External memory access 
require 3 or more cycles. The byte ordering mode was little endian. 

The Transputer had been available in a commercial chipset for enough time for an 
installed base of applications and software to emerge. Software compilers included Ada, 
parallel c/c/c++, Fortran, Pascal, Modula-2, Prolog, Occam, Forth, LISP and others. A 
cross-assembler was available, but rarely used. Specialized debug tools for the parallel 
environment were emerging from companies such as Inmos and Logical Systems Corp. 
Tools were available in the pc and Unix environments. Inmos's integrated development 
environment included a folding editor, a compiler, and a debugger. It was hosted on pc 
based systems. A Unix for the Transputer was available, as was Linux. 

The Occam Language 

A full discussion of the theory and implementation of the Occam language was beyond 
the scope of this book. To gain a full understanding of the Transputer, you have to 
understand the background of Occam. To use the Transputer at a basic level, an in-depth 
understanding is not necessary. 

Occam provides the conceptual framework, and the tools for programming parallelism. A 
discussion of the degrees of parallelism is in order. Superscalar machines exploit 
independent execution of multiple execution units to achieve a low level parallelism, and 
break the one instruction per clock limit per package. Certain classes of problems 
decompose easily into autonomous subtasks for simultaneous execution on vector 
machines. Vectorizing compilers ferret out this latent parallelism from inherently 
sequential process. Occam forces us to program in parallel, a mindset switch that does not 
come easily, but is worth the effort. Explicit parallelism in the instantiation of the 
program leads to the best results, but existing languages such as Pascal and 'c' may be 
extended with parallel constructs to ease the programmer transition to this new paradigm. 



Granularity of the process refers to the size or level we decompose the parent process 
into. The level of granularity effects the computation to communication ratio. 

A ten person-year task is done by 1 person in ten years, but can't necessarily be done by 
10 persons in 1 year, or by a staff of 3,650 in one day. Coarse-grained parallelism (an 
example is the Mandelbrot set calculation) refers to processes that are largely 
independent, and require little or no communication. Fine grained decomposition results 
in more and smaller portions in greater level of detail, with correspondingly higher need 
for communication. For example, attempting to decompose the Mandelbrot set 
calculation below the processor per pixel limit would have separate processors for the 
real and imaginary parts of the equation, with a need for communication bandwidth 
sufficient to form the absolute value of the complex number for comparison against a 
limit. 

Problems have an intrinsic granularity that maps best to the processor topology. Then, the 
communication and interaction between processes must be determined. Generally, as the 
granularity is increased, the need for communication is increased. In Occam, a channel is 
the communication mechanism between processes. Among processes on one Transputer, 
the number of channels is unlimited. Between Transputers, the channels were mapped to 
the four available hardware links. 

The origin of the name Occam is traced to the 14th century Sir William of Occam, who's 
principle of Occam's razor literally translated from the Latin says "entities must not be 
multiplied beyond what is necessary', or, in the vernacular, "kiss: keep it simple, stupid". 
What William was trying to say is, essentially, of two or more solutions or approaches 
always choose the simplest. In the language Occam, an independent task is a collection of 
simple or atomic tasks and events. A process is mapped to one or more processors. 

Concurrent processes were completely independent, could run simultaneously, and have 
no shared variables. Processes communicate via channels. Channels map to links, and 
processes were implemented as software entities on Transputers. 

Occam is a language that makes the description of the parallelism of the problem easier. 
It is a structured system description language. It had many features of popular 
programming languages, but extends these with the PAR constructor, which says t 
execute operations in parallel. Occam is the language that implements the concept of 
Communication Sequential processes. The Occam model of concurrency uses processes 
that run independently, but communicate with other processes. Occam could also 
implement simple sequential processes. Parallelism is then like a pipeline of processes. 

The Transputer hardware instantiates the Occam concurrency model. This leads to an 
architecture that is ideal for real-time control applications. Occam allows for the easy 
description of parallelism. 

In the Transputer world, a channel was a point-to-point communication path which is 
unidirectional, unbuffered, and synchronized. 
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The strength of the Transputer lies in its computation/communication balance, and its 
inherent parallelability. Networks of Transputers, connected by their links, could be used 
in parallel processor fashion to tackle large problems 

The Occam language implements a simple syntax for parallel processes, and for 
communication between processes. Multiple processes could be implemented on one 
Transputer, or on multiple Transputers; The communication method at the software level 
was the same. On one Transputer there is virtual concurrency, and one multiple 
Transputers there is real concurrency of processes. The Transputer does have an 
assembly language, but its use was discouraged by Inmos, who take the view that the 
Transputer hardware implements Occam, concurrency in the Transputer 

Concurrent processes were executed using a linked list approach. A process could be 
active or inactive. An active process could be executing or awaiting execution. An 
inactive process, which consumes no processor time, may be waiting for I/O resources, or 
for a particular time to execute. 

Projects as Case Study's 

The following two study's present details on the real-world applications that Transputers 
could address. The author participated in both. The 64-node parallel processor built with 
custom boards at Loyola College by students implemented a Massively Parallel Machine 
at low cost. The use of Transputers, in their rad-hard form showed major promised for 
spaceflight systems. 

The Spacecraft Supercomputer Project 

This work was supported in part by NAS A/Goddard Space Flight Center during 1991. 

The goal of the effort was to define architecture with an order of magnitude performance 
increase over existing spacecraft onboard computing resources. This goal was exceeded, 
by exploiting scalable parallel architectures. 

This study established the feasibility of using off-the-shelf hardware with a known path 
to full space qualification to address a large set of user processing needs in the area of 
sensed data sets for Earth Observation, including weather, data. It is no longer feasible to 
collect large volumes of data that were downlinked over already strained communication 
channels to large central archive and processing centers based on old-generation 
computational resources. Now, the choice is between not collecting the data versus 
generating useful data products close to the data source, and directly downlinking these to 
users in the field, in addition to and in parallel with the classical data collection and 
downlink tasks. In many cases, the user's needs were for the receipt of timely data 
products directly at a remote site. This is not cost effective or even feasible for most 
current or planned data processing facilities for downlinked sensed data. 



11 



One major focus of the High Performance Computing and Communications (HPCC) 
Program is "in making parallel computing easier to use and scalable..." NASA's role in 
HPCC is "to accelerate the development and application of high performance computing 
technologies to meet NASA's science and engineering requirements." This applies to 
NASA's ground based systems, and also to flight systems, which lag further and further 
behind. Application of these techniques involve a major effort in system design. The 
application of new architectures involve the development of new toolsets, software, and 
paradigms. 

This effort focused on extracting information from raw data onboard the spacecraft at the 
sensor, and established the feasibility of generating data products onboard for direct 
downlink in a timely manner by using proven hardware, architecture, and algorithms. 
This approach does not preclude or interfere with the normal collection and archiving of 
sensed data, but rather works with data tapped off the main stream, and processed in a 
parallel stream. 

Scalable parallel processing techniques are applicable to a large set of spacecraft onboard 
processing tasks now and in the immediate future. This application will provide the 
capability to generate and rapidly distribute data products that cannot otherwise be done. 

At the time, the Transputer was the best candidate for implementation of this approach. 
The Transputer is supported by an Ada compiler from Alsys Corp, as well as numerous 
other languages with parallel extensions. 

Although the goal of this study was to define an architecture with an order of magnitude 
performance increase over existing onboard computing resources, it was shown that 
several orders of magnitude were feasible. With scalable processor/communication 
resources, the hardware could be more appropriately matched to the problem domain, 
while retaining redundancy and reprogrammability. 

The early phases of the study identified a series of candidate science payloads or 
instruments that the Flight Supercomputer can provide services to. The key goal in this 
phase was to identify a real application that can be implemented without impacting the 
instrument schedule or mission success, but that would allow the collection of data that 
would otherwise be lost. An observing class instrument was preferred, as it would 
provide a large data source. 

Requirements for throughput and processing were derived from EOS instruments-class 
data. These data provided a strawman set of requirements. Numerous applications were 
found that could benefit from the utilization of parallel processor technology. These all 
revolved around the onboard generation and direct downlink of data products of a local 
interest in a timely fashion to end users in the field. In most cases, the timeliness issue 
precluded the downlink of data to a classical ground processing facility for data product 
generation, followed by a dissemination, possible by re-uplink and rebroadcast. The 
complexity of the onboard system interacts with the complexity, and thus cost, of the 
ground station equipment. This introduced the concepts of data 



12 



compression/decompression for transmission. The ground receiving station, exclusive of 
the RF portion, was considered to be a laptop class computer, augmented with a front-end 
processor, probably based on a complementary scalable parallel processor. This front-end 
processor would be a custom designed box. In most cases, the ground based application 
would be cost sensitive. 

Space Computer Corp. of California was under contract to DARPA to produce a 
"Miniaturized, Low-Power Parallel Processor". Their approach had been to use the 
Transputer as a communication element for Vector co-processors. A prototype system for 
guided missile applications was delivered in April 1990, and provided a peak processing 
throughput of 1.3 Gflops. Current efforts focus on micro miniaturization of the 
technology, using custom designed ASIC's and wafer scale integration. 

The applications for the resulting device include sensor image processing, and Synthetic 
aperture radar (SAR) processing, including image compression tasks. The SCC-100 is a 
multinode device, with each node consisting of a Transputer, memory, and Zoran vector 
signal processor chips. The Zoran chips provided the computational throughput, and the 
Transputers provided communications and control. Flight units will require the 
availability of rad-hard, Mil-spec die. 

There are potential application of the Flight Supercomputer to data processing 
requirements derived from Earth observing class spacecraft, including weather satellites. 

The Earth Observation Satellite platforms have two direct broadcast channels, 1 
supporting a 15 Mbps rate, and the other supporting 100 mbps. The high rate channel is 
nominally dedicated to 1 instrument, but can provide a backup to the nominal TDRSS 
high rate link. The EOS platform instrument set is still not completely defined, but a 
representative set was used to determine if the Transputer link I/O was capable of 
supporting the collection of data. Processing of data was not considered, since the 
algorithms were yet undefined. 

In all cases, the data input capacity of a single link on the T-800 Transputer is sufficient 
to handle the average data rate of the instrument. A single link is also sufficient for the 
peak rate for all but the HIRIS and ITIR. With 4 links, the Transputer can input more 
than 1 instrument stream continuously, and with external multiplexing, can handle the 
entire instrument set. Using data from the an early EOS- A instrument set, the 
instruments' data could be handled by a T-800 Transputer link, with the exception of the 
HIRIS instrument in peak mode, which outputted data at 160 Mbps. The EOS instrument- 
derived requirements enveloped the instrument data rates for Earth observing missions in 
the near future. 

Numerous applications exist in the field of Earth Resources mapping, particularly where 
asynchronous events directly affect human activities, or require timely response. In many 
cases, the required data product calculation and distribution must be performed at the data 
source. This implies the capability of onboard processing of sensed data, and direct 
downlink of the resultant data products, in parallel with the normal data downlink. It is 
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anticipated that several algorithms could reside in onboard memory, and that code could 
be uplinked rapidly to implement new or modified algorithms in response to 
unanticipated events. 

One potential data product for onboard calculation involves the determination of the 
perimeter of a forest fire, which is a classic edge-detection problem, given a thermal band 
image. Direct downlink to the forest fire command center is essential for freshness of the 
data. The on-site equipment must be small, lightweight, and inexpensive, so that it can be 
air-dropped and abandoned if necessary. 

The data products of interest to the fire management crews on the ground include the 
flame front location, direction and rate of spread, smoke plume dimension, and hot spot 
detection. These data were used for personnel and equipment placement and logistics and 
safety. Observations from U-2 aircraft as well as TIROS series weather satellites have 
been used to gather relevant data on fires. 

Perimeter determination can be accomplished with a Laplacian operator, which is 
homogeneous. This is followed by a scaling, a full rectification, and a thresholding. This 
can be done on a single Transputer at the data rate. The onboard data storage 
requirements can be minimized by clever organization of the algorithm. Data storage is a 
premium item that must be minimized for space flight use. Not only are the storage 
devices expensive, but they also consume resources such as size, weight, and onboard 
power. 

Similar to the determination of the perimeter of a fire is the determination of the area of 
an oil slick on a body of water. In this case we are interested in the perimeter, but can 
determine the extent of the "blob", and track the shape, location, and drift. Downlink data 
can be send directly to the site, and used to position containment equipment. The oil slick 
identification problem maps easily to the problem of determining the extend of the spread 
of floodwaters. 

Another related problem is the timely determination of the location of schools offish near 
the continental shelves, using ocean colorimetry. This process is of interest to commercial 
fishing fleets, and the timeliness of the information is essential, requiring direct downlink 
to fishing fleets. Cost of the fleet equipment is also an issue. 

Similar applications involve specialized operations to image active volcanoes, or to locate 
and track the eye of severe storms (hurricanes or typhoons). All of these processes are 
classical image processing applications that can be hosted on one or several Transputers. 
In most cases the core algorithm is less than 100 lines of code, but is applied across a real 
time data set. In this case, a systolic pipeline of Transputers is the ideal topology. 

Using successive images of cloud formations, properly registered by visible land mass, 
the wind vectors may be inferred by cloud motion. This process involves multi-spectral 
imaging (visible and infrared) from spacecraft such as SMS and GOES. The derived wind 
information is of importance in global weather pattern understanding and prediction, and 
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is critical to severe storm forecasting. Implementing this process onboard will result in 
the ability to generate an downlink a data product that is of timely interest. 

The required derivation of the data on the ground had been difficult and time-consuming, 
and had not been available operationally. The algorithms described in the literature 
postulate quasi-real time implementation on super minicomputers augmented with array 
processors. This process could also be implemented on arrays of Transputers, and may be 
the most promising candidate for further implementation. 

Revisiting that part of the study that looked at the initial assumption of using the 
Transputer, we want to see if that was still valid in the light of advancing technology, and 
product announcements from other vendors, as well as advances from Inmos. 

A number of alternate computer and connectivity architectures for the flight 
supercomputer were examined to establish a taxonomy of choices and grade these 
according to the realities of schedule and availability. Existing and emerging RISC chips 
such as the MIPS R3000, the Intel i80860 and 960, the Motorola 96000 DSP series, etc. 
were examined to determine if there was a better alternative to the Transputer chip, 
before committing to a processor choice. Of concern were availability, vendor support, 
and software development tools. 

No processor could be found that could provide the same level of connectivity as the 
Transputer, and the Transputer was ahead in terms of Processor-I/O balance. The current 
unavailability of the unit in space-qualified versions is the only drawback. The closest 
second candidate was the recently announced Texas Instruments 'C40 Parallel DSP, 
which is marketed as a DSP, not a general purpose computer. Because TI had a history of 
developing Military versions of its products, it is worthwhile to continue to track its 
development. The communication architecture of the C40 uses parallel ports with dma 
engines, and is necessarily distance-limited. Data on the '040 only recently became 
available, and it was not possible to completely evaluate it for the purposes of this study. 
No other architecture could be found that provided the Transputer's inherent 
communication capability and connectivity, without extensive glue-logic. In fact, many 
emerging systems use the Transputer as a communication element for fast RISC 
processors such as Intel's i860 or Motorola 96000 series. The remainder of this section 
discusses some of the other RISC architectures that were examined. 

The R3000 and Intel 80960 family were selected for the 32-bit follow-on to the 1750A 
architecture for military avionics by the JIAWG. However, neither provide a connectivity 
solution comparable to the Transputer. The i80386 is being qualified for use on the Space 
Station Freedom and with the Flight Telerobotic Servicer, but that chip is not designed 
for multiprocessing. 

As with any space application, the usability of parts will lag the commercial state of the 
art by 3-10 years. Any new architecture suggested must be available in the correct 
package, and available with the applicable process screening. 
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Examining the general problem domain for onboard processing of sensed data, the full 
spectrum from matrix multiplication (compute-bound) to matrix addition (I/O bound) is 
seen. Compute bound processes can always be speeded up by faster computational 
components, faster memory, or a smarter architecture. I/O bound problems present a 
greater challenge. In fact, it is relatively easy to transform a compute bound problem to 
an I/O bound problem with a parallel processor. One solution, studied at Carnegie-Mellon 
University, is the systolic processor, which is a matrix of simple, interconnected 
processor elements with I/O at the boundaries, and a pipelined processing approach to 
data that is pulsed through the array. In this scheme (since instantiated in the iWarp 
product, by Intel), multiple use can be made of each data item, and a high throughput can 
be achieved with modest I/O rate. There is extensive concurrency and modular 
expandability, and the control and data flow are simple and regular. This technique, 
easily implemented on Transputers, lends itself well to repetitive operations on large data 
sets, such as those generated by spaceborne sensors. 

Scalable systems, those made up of multiple computational/communication building 
blocks, have an architecture that is responsive to the problem domain. In such a 
homogeneous system, the correct amount of processing and I/O can be provided for the 
initial requirements, with the ability to expand later in a building block fashion to address 
evolved requirements as well as redundancy or fault tolerance. Developing software for 
scalable systems is a challenge, mostly in deciding how the software is spread across the 
computational nodes. This is a solvable problem, based both on good software tools and 
on programmer experience. Research into these topics, as well as the ability of the system 
itself to adapt to processing load, is ongoing. 

Of course, the applicability of the parallel processor to a given problem set implies that 
the applicable algorithm can be parallelized, and a solution can be implemented and 
debugged in a reasonable time. This implies that an efficient programming and 
debugging environment exist for the selected hardware. This is certainly the case for 
Transputer-based systems. The major hurdle is conceptual for the systems integrators - 
the ability to think in parallel paradigms. This comes with hands-on experience. 
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Design & Construction of Loyola College's 64 Node Parallel Processor 

This section describes the design & construction of a 64 node parallel processor based on 
the Inmos Transputer, for the graduate engineering program at Loyola College. This 
project provided valuable hands-one experience to staff and students, and resulted in a 
valuable institutional resource for other programs, such as for the development of 
software courses. 

In June 1992, Loyola had most of the parts needed to integrate a 64 node parallel 
processor at the Columbia lab, thanks to generous donations from Inmos Corp. We had 
on hand sufficient T414 Transputer processors, over 500 pieces of SRAM memory, 
multiple copies of the development software system, and numerous technical reports and 
data books. 

At the lab at the Columbia campus, we had set up a Transputer development workstation, 
hosted in a PC, that allowed us to develop code for the parallel machine in 'c', Pascal, 
FORTRAN, or Occam. This system, consisting of a single Transputer at present, is used 
for student projects. Later, we duplicated this development system. 

Besides the experience gained by building and demonstrating this unit, other departments 
at Loyola were seen to benefit. Engineering Science was providing a resource that others 
could use. For example, ongoing projects at Loyola included using parallel processing for 
image processing research, using Transputers for embedded robot control, and setting up 
a parallel processing course. 

Out target was to do this project at a minimum cost, using existing and donated resources, 
and student labor. Engineering Science would then provide the use of the machine as a 
resource to students in our program, and for other departments. The parts budget was 
$4000. 

A 4-node Transputer board was prototyped and designed by a Loyola alumnus. The 4- 
node board was designed to be used as 4, 1 node boards, or a single 4 node configuration. 
We originally planned to procure several Augat-style wire-wrap boards. Each node of the 
parallel processor machine will require a Transputer chip, a clock source, memory, and 
several "glue" chips. A design for the node was completed. A node was estimated to take 
3 hours to wire-wrap, and 2 hours to checkout and debug. Each Augat card could hold 8- 
10 nodes. However, experience with wire- wrapped Transputer systems showed 
significant electrical noise and interference problems with the memory interface. The 
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circuit board approach was chosen. Sixteen of the boards , plus two spares were produced, 
to house the donated Transputers. 

Similarly, we originally planned to make use of the donated SRAM for the system 
memory. Each node was to have, at a minimum, 256k bytes of memory, for a total of 16 
Megabytes system memory. We did not have, however, enough of pin-compatible 
memory to use for the 16 boards, without doing several derivative designs. Thus, we 
switched to purchased SRAM chips for the design. The SRAM chips were the most 
expensive procured component of the system. In a Transputer system, the memory must 
be 32 bits wide, forcing use of 4 pieces of byte wide memory. 

The quad board was able to use a single 5 MHz clock oscillator for all four nodes on the 
board. Transputers' links were connected to nearest neighbor, and the spare links were 
buffered off board using TTL drivers. Inmos standard up, down, and system control 
system resources were provided. 

The boards were assembled, populated, and checked out over the summer of 1993 by 
students. All of the boards and chips worked as planned. After unit test, an integrated 
system was built up as interconnect cables were fabricated. 

The system was completed in November of 1993 with the design and construction of a 
power distribution board. Checkout was particularly easy, using the most rudimentary of 
software tools. The complexity of the system was much less than that of an integer 
processor of the same parts count, because in the case of the parallel processor, it was 64 
identical, replicated circuits. Using only the Public domain software utilities CHECK and 
MTEST, we were able to debug the hardware in one evening. The software was tell us 
the node that had an error, which mapped to a particular board. For memory problems, 
we had the node and the byte, which mapped to a chip. In most cases a cursory visual 
inspection would reveal a missed solder joint, or an incorrect chip. 

As of 1994, a card cage was being fabricated for the machine, to protect the board 
interconnect cables. Plans were being made to connect the parallel processor's host 
machine, a 80386, to the network, and thence to the Internet. A graduate level course on 
parallel programmed was proposed, based on the machine. And, faculty members were 
exploring the feasibility of using the machine for code previously run on a Cray. Studies 
of the SPRINT-2 architecture, a similar system using 64 Transputers, showed it to have 
an equivalent speed of execution to that of a Cray Y-MP. 
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